首页> 外文OA文献 >Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization
【2h】

Partitioning Algorithms for Improving Efficiency of Topic Modeling Parallelization

机译:分区算法提高主题建模效率   并行

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Topic modeling is a very powerful technique in data analysis and data miningbut it is generally slow. Many parallelization approaches have been proposed tospeed up the learning process. However, they are usually not very efficientbecause of the many kinds of overhead, especially the load-balancing problem.We address this problem by proposing three partitioning algorithms, whicheither run more quickly or achieve better load balance than currentpartitioning algorithms. These algorithms can easily be extended to improveparallelization efficiency on other topic models similar to LDA, e.g., Bag ofTimestamps, which is an extension of LDA with time information. We evaluatethese algorithms on two popular datasets, NIPS and NYTimes. We also build adataset containing over 1,000,000 scientific publications in the computerscience domain from 1951 to 2010 to experiment with Bag of Timestampsparallelization, which we design to demonstrate the proposed algorithms'extensibility. The results strongly confirm the advantages of these algorithms.
机译:主题建模是数据分析和数据挖掘中非常强大的技术,但通常速度较慢。已经提出了许多并行化方法来加速学习过程。但是,由于存在许多额外开销,尤其是负载平衡问题,它们通常不是很有效。我们通过提出三种分区算法来解决此问题,它们比当前分区算法运行更快或实现了更好的负载平衡。可以轻松扩展这些算法以提高类似于LDA的其他主题模型(例如Bag of Timestamps)的并行化效率,这是LDA的时间信息扩展。我们在两个流行的数据集NIPS和NYTimes上评估了这些算法。我们还建立了一个数据集,该数据集包含1951年至2010年计算机科学领域中超过1,000,000篇科学出版物,以试验“时间标记”并行化,我们设计该数据集来证明所提出算法的可扩展性。结果强烈证实了这些算法的优点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号